% Advanced FPGA Design - Exercise 1
% Norbert Tremurici (e11907086@student.tuwien.ac.at)
% 25th November, 2023

# Advanced FPGA Design

For this exercise, we have to implement an asynchronous FIFO to be used for clock domain crossing.
To do this, we based our implementation on the textbook for this course, Steve Kilts' "Advanced FPGA Design", which explains quite well what are the challenges, uses and tricks behind such a design.

A block diagram can be seen in Figure \ref{fig:block-diagram}, which shows the two clock domains with generic systems running with their respective frequency and writing or reading from the asynchronous FIFO.

![Simple Block Diagram\label{fig:block-diagram}](graphics/block-diagram.pdf)

\newpage

#### Circuit Design

To keep the FIFO throughput at 100%, which is the limit imposed by the frequency of the slower side, in this case the 100 MHz of the read domain, we dimension our FIFO to have a depth of at least $2^4 = 16$ elements.
In our FIFO, we use internal block RAM, so this can easily be provided, we could even choose larger sizes.
Internally the implementation cycles all addresses of the block RAM, with the parameterized address width, which is configured to be four in our top module and simulation.

For the top module it was assumed that signals `wr_clk`, `wr_en` and `wr_data` are provided externally, so they are assigned on pins in the design.
A `set_input_delay` constraint on `wr_data` and `wr_en` was set to associate it with the clock `wr_clk`.
The clock `wr_clk` is constrained to have a 50% duty cycle and 200 MHz frequency (clock period of 5 ns).

#### Optimized Reset

The reset signal is also assigned on the pin of the center push button of the Zedboard.
Because this input is asynchronous, a `set_false_path` constraint was placed to exclude it from timing analysis.
To deal with potential metastability, a 2-stage synchronizer (3 registers) was used for this reset signal, which also provides us with a synchronous reset signal at the end of the 2-stage synchronizer.
The synchronous reset allows us to optimize the circuit, as it lets us use flip-flops with synchronous resets.

On the read side, we output everything to 5 on-board LEDs.
We have chosen to forward the `full` output of the FIFO to LED 4 (but also to a dedicated pin for the write side).
The other 4 LEDs use a block wise `xor` reduction of the 16-bit `rd_data` output of the asynchronous FIFO, so that each block of four bits corresponds to one of the four LEDs.

Because we optimized the reset, only one type of flip-flop is used, namely flip-flops with synchronous resets.

Below we provide two exported schematics (in Figure \ref{fig:top} and Figure \ref{fig:top-expanded}) showing the design of the top module and also an expanded view with the components of the asynchronous FIFO module.

![RTL Schematic of `top` module\label{fig:top}](graphics/schematic-top.pdf)

![Expanded view of RTL Schematic of `top` module\label{fig:top-expanded}](graphics/schematic-top-expanded.pdf)

#### Optimizations

First, all automatic optimizations performed by Vivado have been disabled.
To target both performance and area, the following optimization settings have been manually set (after some experimentation, tweaking and comparing):

- `flatten_hierarchy` to `full`: this optimization removes arbitrary boundaries of the design, which can be seen quite clearly in the post-synthesis view
- `gated_clock_conversion` to `on`: this optimization is relatively straightforward, looking at the post-synthesis view one can see that this is actually used to optimize the registers
- `retiming` to `on`: this optimization serves to balance our registers, which can also be seen to have an effect in the post-synthesis view
- `fsm_extraction` to `one_hot`: this optimization is not actually used, as there are no FSMs, but if we had any, we would like to use this optimization so it has been set anyways
- `resource_sharing` to `on`: this optimization is nice to have to reduce area consumption of the design
- `cascade_dsp` to `force`: this optimization removes the adder tree restrictions imposed by the structure of our RTL design and uses DSP block adders where appropriate

#### Simulations

The simulation is structured like the top module, with one side writing whenever possible at a frequency of 200 MHz and one side reading whenever possible at a frequency of 100 MHz.
A trace showing successful back-to-back transmission can be seen in Figure \ref{fig:sim}.

![Simulation trace (orange=read side, yellow=write side)\label{fig:sim}](graphics/sim.png)

#### Constraints

The full constraints used in this design has been provided as an additional submission (as with the schematics and simulation trace), but here in this report we will also show the full constraints file:

`fifo_async.xdc`

```tcl
# clock constraint
create_clock -name rd_clk -period 10 [get_ports {rd_clk}];
create_clock -name wr_clk -period 5 [get_ports {wr_clk}];

# onboard clock port
set_property PACKAGE_PIN Y9 [get_ports {rd_clk}];  # "GCLK"

# PMOD A port
set_property PACKAGE_PIN Y11  [get_ports {wr_en}];  # "JA1"
set_property PACKAGE_PIN AA11 [get_ports {full}];  # "JA2"
set_property PACKAGE_PIN AA9  [get_ports {wr_clk}];  # "JA4"

# PMOD B port
set_property PACKAGE_PIN W12 [get_ports {wr_data[0]}];  # "JB1"
set_property PACKAGE_PIN W11 [get_ports {wr_data[1]}];  # "JB2"
set_property PACKAGE_PIN V10 [get_ports {wr_data[2]}];  # "JB3"
set_property PACKAGE_PIN W8 [get_ports {wr_data[3]}];  # "JB4"
set_property PACKAGE_PIN V12 [get_ports {wr_data[4]}];  # "JB7"
set_property PACKAGE_PIN W10 [get_ports {wr_data[5]}];  # "JB8"
set_property PACKAGE_PIN V9 [get_ports {wr_data[6]}];  # "JB9"
set_property PACKAGE_PIN V8 [get_ports {wr_data[7]}];  # "JB10"
# PMOD C port
set_property PACKAGE_PIN AB6 [get_ports {wr_data[8]}];  # "JC1_N"
set_property PACKAGE_PIN AB7 [get_ports {wr_data[9]}];  # "JC1_P"
set_property PACKAGE_PIN AA4 [get_ports {wr_data[10]}];  # "JC2_N"
set_property PACKAGE_PIN Y4  [get_ports {wr_data[11]}];  # "JC2_P"
set_property PACKAGE_PIN T6  [get_ports {wr_data[12]}];  # "JC3_N"
set_property PACKAGE_PIN R6  [get_ports {wr_data[13]}];  # "JC3_P"
set_property PACKAGE_PIN U4  [get_ports {wr_data[14]}];  # "JC4_N"
set_property PACKAGE_PIN T4  [get_ports {wr_data[15]}];  # "JC4_P"

# LEDs
set_property PACKAGE_PIN T22 [get_ports {led[0]}];  # "LD0"
set_property PACKAGE_PIN T21 [get_ports {led[1]}];  # "LD1"
set_property PACKAGE_PIN U22 [get_ports {led[2]}];  # "LD2"
set_property PACKAGE_PIN U21 [get_ports {led[3]}];  # "LD3"
set_property PACKAGE_PIN V22 [get_ports {led[4]}];  # "LD4"

# push buttons
set_property PACKAGE_PIN P16 [get_ports {reset}];  # "BTNC"

set_property IOSTANDARD LVCMOS33 [get_ports -of_objects [get_iobanks 33]];
set_property IOSTANDARD LVCMOS18 [get_ports -of_objects [get_iobanks 34]];
# bank 35 not used
#set_property IOSTANDARD LVCMOS18 [get_ports -of_objects [get_iobanks 35]];
set_property IOSTANDARD LVCMOS33 [get_ports -of_objects [get_iobanks 13]];

set_max_delay 5 -datapath_only -from [get_clocks wr_clk] -to [get_clocks rd_clk]
set_input_delay 2 -clock [get_clocks wr_clk] [get_ports wr_en]
set_input_delay 2 -clock [get_clocks wr_clk] [get_ports wr_data]

set_false_path -from [get_ports reset]
set_false_path -to [get_ports full]
set_false_path -to [get_ports led]
set_false_path -from [get_clocks rd_clk] -to [get_clocks wr_clk]
set_false_path -from [get_clocks wr_clk] -to [get_clocks rd_clk]
```

#### Results

In Figure \ref{fig:utilization} we show the post-synthesis utilization of the design.
Only one LUT is used as memory, the rest serves to encode our combinatorial logic.
As can be seen, internal block RAM is used for the fifo.

![Utilization Report Post Synthesis\label{fig:utilization}](graphics/utilization-post-synthesis.png)

In Figure \ref{fig:timing} we show the post-synthesis timing (intra-clock paths) of the design.
The constraints we used have taken improved our timing and in particular it has violated any violations of inter-clock paths.

Our maximum frequencies can be calculated as follows:

$$
\begin{aligned}
f_{max,rd} &= \frac{1}{10^-9 \cdot (T_{rd} - WNS_{rd})} \cdot 10^6\ \text{MHz} \\
&= \frac{10^3}{T_{rd} - WNS_{rd}}\ \text{MHz} \\
&= \frac{10^3}{10 - 6.934}\ \text{MHz} \\
&\approx 326.158\ \text{MHz} \\
\end{aligned}
$$

$$
\begin{aligned}
f_{max,wr} &= \frac{1}{10^-9 \cdot (T_{wr} - WNS_{wr})} \cdot 10^6\ \text{MHz} \\
&= \frac{10^3}{T_{wr} - WNS_{wr}}\ \text{MHz} \\
&= \frac{10^3}{5 - 1.356}\ \text{MHz} \\
&\approx 274.424\ \text{MHz} \\
\end{aligned}
$$

![Timing Report Post Synthesis\label{fig:timing}](graphics/timing-post-synthesis.png)

As can be seen, we have solved the problem with quite a small design in terms of resources, while also fulfilling our timing and even overachieving in this regard.
If we wanted to achieve even higher performance, we could extend it using pipelining.
